Reporting Preliminary Automatic Comparable Corpora Compilation Results
نویسنده
چکیده
Translation and translation studies rely heavily on distinctive text resources, such as comparable corpora. Comparable corpora gather greater diversity of language-dependent phrases in comparison to multilingual electronic dictionaries or parallel corpora; and present a robust language resource. Therefore, we see comparable corpora compilation as impending in this technological era and suggest an automatic approach to their gathering. The originality of the research lies within the newly-proposed methodology that is guiding the compilation process. We aim to contribute to translation and translation studies professionals’ work by suggesting an approach to obtaining comparable corpora without intermediate human evaluation. This contribution reduces time and presents such professionals with non-static text resources. In our experiment we compare the automatic compilation results to the labels, which two human evaluators have given to the
منابع مشابه
Merging Comparable Data Sources for the Discrimination of Similar Languages: The DSL Corpus Collection
This paper presents the compilation of the DSL corpus collection created for the DSL (Discriminating Similar Languages) shared task to be held at the VarDial workshop at COLING 2014. The DSL corpus collection were merged from three comparable corpora to provide a suitable dataset for automatic classification to discriminate similar languages and language varieties. Along with the description of...
متن کاملLearning Comparable Corpora from Latent Semantic Analysis Simplified Document Space
Focusing on a systematic Latent Semantic Analysis (LSA) and Machine Learning (ML) approach, this research contributes to the development of a methodology for the automatic compilation of comparable collections of documents. Its originality lies within the delineation of relevant comparability characteristics of similar documents in line with an established definition of comparable corpora. Thes...
متن کاملFully Automatic Compilation of Portuguese-English and Portuguese-Spanish Parallel Corpora
This paper reports the fully automatic compilation of parallel corpora for Brazilian Portuguese. Scientific news texts available in Brazilian Portuguese, English and Spanish are automatically crawled from a multilingual Brazilian magazine. The texts are then automatically aligned at documentand sentence-level. The resulting corpora contain about 2,700 parallel documents totaling over 150,000 al...
متن کاملImproving Machine Translation Performance Using Comparable Corpora
The overwhelming majority of the languages in the world are spoken by less than 50 million native speakers, and automatic translation of many of these languages is less investigated due to the lack of linguistic resources such as parallel corpora. In the ACCURAT project we will work on novel methods how comparable corpora can compensate for this shortage and improve machine translation systems ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013